Welcome to Introduction to Social Network Analysis workshop! This file will introduce to you the concept of social network analysis and its application to international law research. The workshop is specifically designed for law students with some limited backgrounds in R. If you have no knowledge or no background in R, I highly recommend that you take some basic introductions to R, including working with vectors, lists, matrices, data wrangling, and data visualization. Hence, in taking this workshop, I assume that you have some of those basic knowledge of R.
The objectives of this workshop are:
At the end of the workshop, you should be able to adapt what you have learned to your own legal research.
First, create a new R Project on your computer. Make sure that you know where you save this R project. All of the outputs you have will be saved in this folder. Next, run the following lines of code to download the necessary R packages for this workshop.
#if any error message occurred, make sure you have installed these packages first before you load them
library(readxl) #to import data
library(tidyr) #reshaping date
library(dplyr) #for data wrangling
library(knitr) #to show table on RMarkdown
library(ggplot2) #for data visualization
library(RColorBrewer)
library(igraph) #we will primarily use igraph package in this workshop
Second, you will have to import the data, which we will use throughout the workshop, which you can inspect here. The data set is from the DESTA (Design of Trade Agreements) Database, a project that contains comprehensive data on various types of preferential trade agreements (PTAs).
Now, you are ready to import the data to your RStudio workspace:
excel_file <- tempfile()
download.file(params$url,excel_file, mode = "wb")
pta_data <- read_excel(path = excel_file, sheet = 2)
pta_data$regioncon <- as.factor(pta_data$regioncon)
#inspect the data
str(pta_data) #look at the structure of the data
## tibble [18,329 × 16] (S3: tbl_df/tbl/data.frame)
## $ country1 : chr [1:18329] "Afghanistan" "Algeria" "Algeria" "Algeria" ...
## $ country2 : chr [1:18329] "India" "Egypt" "Ghana" "Guinea" ...
## $ iso1 : num [1:18329] 4 12 12 12 12 12 818 818 818 818 ...
## $ iso2 : num [1:18329] 356 818 288 324 466 504 288 324 466 504 ...
## $ number : chr [1:18329] "1" "2" "2" "2" ...
## $ base_treaty : num [1:18329] 1 2 2 2 2 2 2 2 2 2 ...
## $ name : chr [1:18329] "Afghanistan India" "African Common Market" "African Common Market" "African Common Market" ...
## $ entry_type : chr [1:18329] "base_treaty" "base_treaty" "base_treaty" "base_treaty" ...
## $ consolidated : num [1:18329] 0 0 0 0 0 0 0 0 0 0 ...
## $ year : num [1:18329] 2003 1962 1962 1962 1962 ...
## $ entryforceyear: num [1:18329] 2003 1963 1963 1963 1963 ...
## $ language : chr [1:18329] "English" "English" "English" "English" ...
## $ typememb : num [1:18329] 1 2 2 2 2 2 2 2 2 2 ...
## $ regioncon : Factor w/ 6 levels "Africa","Americas",..: 3 1 1 1 1 1 1 1 1 1 ...
## $ wto_listed : num [1:18329] 1 1 1 1 1 1 1 1 1 1 ...
## $ wto_name : chr [1:18329] "India Afghanistan" "African Common Market" "African Common Market" "African Common Market" ...
head(pta_data) #look at the first few rows of the data
## # A tibble: 6 × 16
## country1 country2 iso1 iso2 number base_treaty name entry_type consolidated
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
## 1 Afghanistan India 4 356 1 1 Afgh… base_trea… 0
## 2 Algeria Egypt 12 818 2 2 Afri… base_trea… 0
## 3 Algeria Ghana 12 288 2 2 Afri… base_trea… 0
## 4 Algeria Guinea 12 324 2 2 Afri… base_trea… 0
## 5 Algeria Mali 12 466 2 2 Afri… base_trea… 0
## 6 Algeria Morocco 12 504 2 2 Afri… base_trea… 0
## # … with 7 more variables: year <dbl>, entryforceyear <dbl>, language <chr>,
## # typememb <dbl>, regioncon <fct>, wto_listed <dbl>, wto_name <chr>
As you can see, the data we are dealing with is very large. Exploring data is crucial for quantitative approach to help us learn about the data, as well as to help us figure out what to do next with the data. We do so by trying to find some patterns from this data set first. Based on the data, two variables that could be of interest here are: year and the region to which a PTA belongs. A simple way to do this is to plot the data using ggplot2 package:
pta_data_count <- pta_data %>% group_by(year, regioncon) %>% count(name) #this line counts the frequency of ties between any two countries formed by year and region.
pta_data_count <- pta_data_count %>% arrange(desc(n))
kable(head(pta_data_count), col.names = c("year", "region", "agreement name", "number of ties"),
caption = "Frequency of Ties Table")
| year | region | agreement name | number of ties |
|---|---|---|---|
| 1991 | Africa | African Economic Community | 1275 |
| 2000 | Intercontinental | Cotonou Agreement | 1140 |
| 1988 | Intercontinental | Global System of Trade Preferences (GSTP) | 1128 |
| 2018 | Africa | African Continental Free Trade Area (AfCFTA) | 946 |
| 1989 | Intercontinental | Lome IV | 816 |
| 2003 | Intercontinental | Cotonou Agreement Cyprus Czech Republic Estonia Hungary Latvia Lithuania Malta Poland Slovakia Slovenia accession | 760 |
kable(summary(pta_data_count), col.names = c("year", "region", "agreement name", "number of ties"),
caption = "A summary of the Frequency of Ties Data") #summary() provide a summary of the data
| year | region | agreement name | number of ties | |
|---|---|---|---|---|
| Min. :1948 | Africa : 91 | Length:1116 | Min. : 1.00 | |
| 1st Qu.:1992 | Americas :231 | Class :character | 1st Qu.: 1.00 | |
| Median :1999 | Asia :140 | Mode :character | Median : 1.00 | |
| Mean :1997 | Europe :260 | NA | Mean : 16.42 | |
| 3rd Qu.:2006 | Intercontinental:382 | NA | 3rd Qu.: 6.00 | |
| Max. :2021 | Oceania : 12 | NA | Max. :1275.00 |
From a summary of data, notice that there are six different categories: Africa, Americas, Asia, Europe, Oceania, and Intercontinental. Most PTAs are intercontinental (382 agreements), while not surprisingly Oceania has only 12 intra-regional agreements. This is due to the number of member countries in the region.
Next, we are going to look at which countries tend to form multilateral PTAs and which countries tend to form bilateral PTAs:
#filter the new data
pta_data_count <- pta_data_count %>% mutate(bilat = ifelse(n == 1, "bilateral", "multilateral")) #creating a new variable to see if a tie belongs to a bilateral or multilateral treaty
pta_data_count$bilat <- as.factor(pta_data_count$bilat)
region_count <- pta_data_count %>% group_by(regioncon, bilat) %>% count(regioncon)
ggplot(region_count, aes(x= regioncon, y = n, color = bilat, fill = bilat)) +
geom_col()
Out of this simple bar plot, we notice that countries in Africa prefer to have multilateral PTAs rather than bilateral ones, while countries in Americas and Asia clearly prefer to form bilateral PTAs instead of multilateral ones. Europe and Oceania are almost equally split between bilateral multilateral PTAs. Interestingly, the inter-regional PTAs are also ambivalent with regards to bilateral and multiplateral PTAs.
ggplot(pta_data_count, aes(x = year, y = n, color = regioncon, fill = regioncon)) +
geom_col() + facet_wrap(~regioncon)
Indeed, plotting how many ties formed each year by regions confirm our hypothesis that Africa stands out with the number of ties formed, along with intra-regional PTAs, followed by Europe. Asia, Americas and Oceania prefers to form bilateral agreements.
Once we have preliminary explored the data, one of the directions we can take is to look at the inter-regional network because that is where PTAs have the most impact on inter-regional trade flows. What else could be further explored based on the data exploratory session? This is a food for thought, as well as for you to try social network analysis on your own.
In this section, I will introduce how to create a network in R. There are two types of data structure that you can setup to construct:
The edgelist is exactly what this DESTA database has done: the dyadic country relationship as shown in the first two coloumns, country1, and country2, the original data set here:
head(pta_data)
## # A tibble: 6 × 16
## country1 country2 iso1 iso2 number base_treaty name entry_type consolidated
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
## 1 Afghanistan India 4 356 1 1 Afgh… base_trea… 0
## 2 Algeria Egypt 12 818 2 2 Afri… base_trea… 0
## 3 Algeria Ghana 12 288 2 2 Afri… base_trea… 0
## 4 Algeria Guinea 12 324 2 2 Afri… base_trea… 0
## 5 Algeria Mali 12 466 2 2 Afri… base_trea… 0
## 6 Algeria Morocco 12 504 2 2 Afri… base_trea… 0
## # … with 7 more variables: year <dbl>, entryforceyear <dbl>, language <chr>,
## # typememb <dbl>, regioncon <fct>, wto_listed <dbl>, wto_name <chr>
Although to construct a simple network, you do not need other variables in the data set, it is useful to keep these variables to, as you will see later, set the tie properties in the network.
Alternatively, you can also construct a network from an adjacency matrix as the underlying data structure.
#ignore these two lines first, we will cover these functions later on
#the point is to show an alternative data structure available to construct a network show in the output here
temp_graph <- graph_from_edgelist(cbind(head(pta_data$country1), head(pta_data$country2)), directed = F)
get.adjacency(temp_graph)
## 8 x 8 sparse Matrix of class "dgCMatrix"
## Afghanistan India Algeria Egypt Ghana Guinea Mali Morocco
## Afghanistan . 1 . . . . . .
## India 1 . . . . . . .
## Algeria . . . 1 1 1 1 1
## Egypt . . 1 . . . . .
## Ghana . . 1 . . . . .
## Guinea . . 1 . . . . .
## Mali . . 1 . . . . .
## Morocco . . 1 . . . . .
The function in igraph package, called graph_from_edgelist() will create a network from the edgelist data for you:
cbind(head(pta_data$country1), head(pta_data$country2))
## [,1] [,2]
## [1,] "Afghanistan" "India"
## [2,] "Algeria" "Egypt"
## [3,] "Algeria" "Ghana"
## [4,] "Algeria" "Guinea"
## [5,] "Algeria" "Mali"
## [6,] "Algeria" "Morocco"
edge_net <- graph_from_edgelist(cbind(head(pta_data$country1), head(pta_data$country2)),
directed = F) #you have created the first network from edgelist!
#see the network:
set.seed(123) #set.seed() fixes the configuration of the plot for the sake of reproducibility and comparison in this case
plot.igraph(edge_net) #don't worry about this we will cover this function in the next section
To create a network from an adjacency matrix, call the function graph_from_adjacency_matrix() from the igraph package.
get.adjacency(edge_net)
## 8 x 8 sparse Matrix of class "dgCMatrix"
## Afghanistan India Algeria Egypt Ghana Guinea Mali Morocco
## Afghanistan . 1 . . . . . .
## India 1 . . . . . . .
## Algeria . . . 1 1 1 1 1
## Egypt . . 1 . . . . .
## Ghana . . 1 . . . . .
## Guinea . . 1 . . . . .
## Mali . . 1 . . . . .
## Morocco . . 1 . . . . .
matrix_net <- graph_from_adjacency_matrix(get.adjacency(edge_net), mode = "undirected")
set.seed(123) #again, set.seed() here to make sure the plot looks similar
plot.igraph(matrix_net)
As you can see, the two networks look exactly the same!
Now, let’s construct the network of countries by their inter-regional PTAs, using the edgelist data already provided in the DESTA data set. Because the data structure is an edgelist data, we use graph_from_edgelist().
pta_intercon <- pta_data %>% filter(regioncon == "Intercontinental")
pta_intercon_net <- graph_from_edgelist(cbind(pta_intercon$country1, pta_intercon$country2), directed = F)
Now that we have successfully created the network, the next step is to assign the value to the ties for further uses. Useful variables to be assigned for the purpose of this workshop are: name and year. The function E() calls an edge sequence of the network, as identified by the corresponding indices. The function can also be used to assign edge based attributes. We will use this to assign some properties of the edges and the weights of edges. Here, we call the edges of the network, using E() followed by $ and the name of the variable we wish to have, agt_name for the PTAs’ names, and year for the year of these PTAs.
E(pta_intercon_net)$agt_name <- pta_intercon$name
E(pta_intercon_net)$year <- pta_intercon$year
Note that there might be duplicated edges between a pair of countries, this is the case when the two countries have formed more than one PTA together. Instead of creating duplicated edges, we can assign a weight to each pair corresponding to the number of PTAs shared by the two countries. Next, using the simplify() function from the igraph package to simplify the network by removing loops and multiple edges. In doing so, simplify() can assign the sum of all duplicated edges to the weight of that edge.
#set weighted ties:
E(pta_intercon_net)$weight <- 1
pta_intercon_simp <- pta_intercon_net %>% simplify(edge.attr.comb = list(weight = "sum"))
It is also important to recognize that once we simplify the edges of the network, we will lose some information about name and year variables to those edges. The simplified network can be useful for the purpose of looking at the descriptives (to be covered in the next section). However, we the names or the years of PTAs are of importance to your analysis, you may have to revert back to the full network.
Similarly, you can also call the nodes of the network by using the function V() followed by the name of the node attribute. There are 191 country nodes engaged in inter-regional PTAs. We will be using this function a lot more later on.
head(V(pta_intercon_net)$name) #showing only the first 6 countries
## [1] "Egypt" "Jordan" "Morocco" "Tunisia" "Albania" "Turkey"
You can look at the basic information by calling a function summary(). As shown in the output here, the line shows IGRAPH, indicating the type of the object and followed by a unique code of the graph. After we have three capital letter, UNW. U indicates that this network is undirected graph, as opposed to D, a directed graph. N indicates that the vertex attribute, name, has been set. And, W indicates that this graph is weighted (with weight edge attributes). Another letter B is for bipartite graphs, the type vertex attributes.
After the two dashes, we have the number of vertices (191) and the number of edges (9316) reported. The second line reports a list of the attributes of the graph with the kind of the attribute, g for graph, v for vertex, and e for edge and the type of the attributes: c for character, n for numeric, l for logical, and x for other.
Call a function print_all() can show similar information but also the edges in the graph.
summary(pta_intercon_net) #print the number of vertices, edges,and whether the graph is directed or not
## IGRAPH cbadd1b UNW- 191 9316 --
## + attr: name (v/c), agt_name (e/c), year (e/n), weight (e/n)
#print_all(pta_intercon_net) #This line is left out to save some spaces from showing all of the edges
After we have successfully created the network, in this section, we will learn to explore some network properties – the descriptives of the network. First, we will look at how to visualize the network. Thereafter, the basic network descriptions such as density, degree distributions, and transitivity will be covered. Next, we will go over various types of network centralities to identify key players in the network.
The main function for network visualization in igraph package is plot.igraph() which allows you to customize your network visuals. There are many options available for you to customize your own network. Examples given here are vertex.size, vertex.color, vertex.label, vertex.label.cex, vertex.frame.color, and layout. Note that vertex. specifies specific shapes, color, and size of the vertex. Similarly, if you want to customize the edges, you can do so by calling edge., such as edge.color or edge.width to specify the thickness of the egdges. edge.width is often used for weighted graphs, showing different weights of the edges, as you shall see later on. For more specification, you can look it up here, an R documentation for plot.igraph().
set.seed(123)
plot.igraph(pta_intercon_net,
vertex.size = 5,
vertex.color = "red",
vertex.label = V(pta_intercon_net)$name,
vertex.label.cex = .5,
vertex.frame.color = NA,
layout = layout.fruchterman.reingold,
main = "The network of inter-regional PTAs"
)